(Bowers et al., 2013; Jerald, 2006; Gleason and Dynarski, 2002)./div>
Most of the 110 at-risk flags found in the literature only include a measure of the sensitivity, or the specificity, but rarely both
(Bowers et al., 2013).
In an effort to bring cohesion and clarity to the comparison of at-risk flags, Bowers et al. (2013) calculated the performance metrics for 110 separate flags found in the literature.
Receiver Operating Characteristic
- ROC represents the tradeoff between the fraction of true non-graduates identified out of all non-graduates (recall), and the fraction of false non-graduates out of all true graduates (precision).
- Can be summarized by AUC (area under the curve) to support decision analysis about the balance between false-positives and false-negatives
- Excellent for optimizing rare-class identification
What is out there?
Adapted from Bowers, Sprott, and Taff 2013

What we really want to know
- Limitation is that, in most cases, these are merely measures of how the model performed in-sample
- We want to estimate the likely accuracy for students receiving predictions today.
- This is critical to inform stakeholders about the appropriate weight to give the evidence provided by the DEWS score.
- Estimating accuracy on future data is a key feature of machine learning.
Estimating out-of-sample error rates
- Unlike model fit statistics, such as AIC, BIC, and \(R^{2}\), out-of-sample fit statistics require re-estimation of the model.
- Select a hold-out data set with observed outcome data and evaluate all models on their performance with that data

Knowing the Context: Wisconsin
- Wisconsin has a very high graduation rate, but very deep racial and economic disparities (relative to other states).
- Wisconsin has hundreds of middle schools, but a small fraction of them account for the bulk of future high school non-completers.
- Wisconsin has no graduation test, low state graduation requirements (with substantial variation between districts), and a majority of school districts with a single high school.
- Wisconsin does not have high enough coursework data quality to support many EWS analyses.
- At the state level, we have lots of observations but relatively few measures.
Wisconsin variables

Machine learning
- A checklist model will not work statewide because of delays in data availability and lack of key coursework measures.
- A regression model seems like a good starting place, but need to identify how to measure accuracy out-of-sample to inform stakeholders.
- Even small incremental gains in accuracy (at the cost of tens of hours at DPI) can save hundreds or thousands of hours in the field.
- Look at setting up a machine learning workflow to standardize this process
The DEWS workflow

A few words on data preparation
- Defining who is a non-completer and who is a completer is difficult but essential to success
- Predictors should be transformed – categories collapsed, numeric indicators centered and scaled, zero-variance data elements dropped.
- Worth an entire paper to discuss challenges here
Modeling
- Be pluralistic and test many models – only marginal cost is CPU time
- Start with some basic models, build complexity in order to find increased accuracy
(James et al., 2013)

Algorithm search
- Start with 35 candidate algorithms including logistic regression
- Select 40,000 training observations and use 10-fold cross validation to tune the parameters for each algorithm
- Calculate the AUC for each algorithm with the test data
- Toss out models with abnormally long run times for practical reasons
- Select the top 5 to 8 models by AUC or by algorithmic diversity
Ensemble
- Retrain the selected models with more data and a wider tuning parameter search
- Ensemble the models into a meta-model (using greedy optimization of the AUC)
- Calculate AUC on a validation data set (not the training or test set)
- Store meta-model for later scoring of new cases
Why ensemble?
Probit model performance

Individual algorithm model performance

Ensemble model performance

Performance with test data

Prediction
- The whole ballgame is to make predictions on current students at scale.
- Need to assign them to a category (low, medium, high risk) based on predicted probabilities
- Calculate an optimal probability threshold by consulting with content experts
- In Wisconsin we determinined that between 10 and 25 false-positives would be acceptable for one less false-negative.
- The later the intervention (and prediction grade) the less acceptable false-positives become.
Combined test and training results for grade 7
| Pred. Grad. |
84,744 |
3,670 |
| Pred. Non. Grad. |
13,718 |
7,454 |
Challenges with machine learning
- Hard to interpret results
- Difficult to get stability
- Concerns with the "black box"
Overcoming challenges
- Communication, communication, communication
- Everyone agrees that accuracy is the priority, so the complexity is required
- Find ways to make complexity approachable and trustworthy
- Transparency is key
Peaking in the black box

Implementation
- Not worth anything if predictions do not reach educators
- Use state reporting tool to disseminate results + a rollout plan
- Break score into risk bands of low, moderate, and high
- Construct subscores based on malleable student factors of academics, attendance, mobility, and behavior
Student Profile

DEWS Box Zoom

Future work
- Make predictions more user-friendly and easier to interpret for educators
- Make DEWS automated enough for IT staff to execute predictions each time new data is loaded
- Test ensembling methods more thoroughly for increased accuracy
- Include rare-class (non-graduate) oversampling to better identify different types of non-graduates
- Automatically set threshold for low, moderate, high risk status
- Blend a localized prediction with prediction from the state model for each district/school
Contact Info
Additional Slides
Algorithms Chosen By Grade

The What of Early Warning Systems
- EWSs are common in many industries and have a number of other names – predictive analytics, risk models, or machine learning.
- Schools currently do a lot of work around identifying students-at-risk in response to federal and state laws and definitions.
- EWS has traditionally been thought of as a high school tool, but is increasingly being introduced into middle school and earlier (Balfanz and Herzog 2006).
- Existing EWS models fall into three broad categories – checklist, regression, and mixture/latent variable models
What does this mean for Wisconsin's system?
- Trying to take advantage of economies of scale – one big accurate analysis disseminated statewide
- Unfortunately working statewide means we have a lot of observations, but a dearth of measures
- Context matters greatly and modeling strategies need to be able to reflect this context
- WI has a high graduation rate, so there is a needle in a haystack problem